{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 不同向量库的对比\n",
    " - MetadataFilter\n",
    " - \n",
    "## "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metadata Query Operators\n",
    "\n",
    "| **运算符** | **描述**                                                   | **支持的值类型**         | **示例**                           |\n",
    "|------------|-----------------------------------------------------------|--------------------------|-------------------------------------|\n",
    "| `$eq`      | 匹配元数据值等于指定值的向量。                             | Number, string, boolean  | `{ \"author\": { \"$eq\": \"john\" } }`  |\n",
    "| `$ne`      | 匹配元数据值不等于指定值的向量。                           | Number, string, boolean  | `{ \"author\": { \"$ne\": \"jack\" } }`  |\n",
    "| `$gt`      | 匹配元数据值大于指定值的向量。                             | Number                   | `{ \"age\": { \"$gt\": 30 } }`         |\n",
    "| `$gte`     | 匹配元数据值大于或等于指定值的向量。                       | Number                   | `{ \"age\": { \"$gte\": 30 } }`        |\n",
    "| `$lt`      | 匹配元数据值小于指定值的向量。                             | Number                   | `{ \"age\": { \"$lt\": 30 } }`         |\n",
    "| `$lte`     | 匹配元数据值小于或等于指定值的向量。                       | Number                   | `{ \"age\": { \"$lte\": 30 } }`        |\n",
    "| `$in`      | 匹配元数据值在指定数组中的向量。                           | String, number           | `{ \"author\": { \"$in\": [\"john\", \"jill\"] } }` |\n",
    "| `$nin`     | 匹配元数据值不在指定数组中的向量。                         | String, number           | `{ \"author\": { \"$nin\": [\"jack\"] } }` |\n",
    "| `$exists`  | 匹配具有指定元数据字段的向量。                             | Boolean                  | `{ \"author\": { \"$exists\": true } }` |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c4d949bd9798455e804bf903ff62a8a6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "80e49732abbb45c9a9d8772bc46a5576",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "26b806f19781492196583b43471c103a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "75e3d02baf0e472baa1dd81b29699189",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/blackink/anaconda3/envs/langchain/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0bff0e235be64e8385b138ee85f8368f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ac5f5001881c4cf29cf5dc53bdd8e98d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "34bd60e834f84d60aebb5217009b41de",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7bd02cc2012744e2a88ab67704233f44",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c62efbcb65fb468da677be49bae6b3cd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "be75e5a2e2ba4ce291be59cc72cdf7e6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1c7724dcd4814be581fdea51d8c2dafc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Number of requested results 10 is greater than number of elements in index 3, updating n_results = 3\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ids': [['1', '3']], 'distances': [[0.2882421016693115, 1.0175083875656128]], 'metadatas': [[{'author': 'john'}, {'author': 'jill'}]], 'embeddings': None, 'documents': [['Article by john', 'Article by Jill']], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}\n",
      "{'ids': ['1', '3'], 'embeddings': None, 'metadatas': [{'author': 'john'}, {'author': 'jill'}], 'documents': ['Article by john', 'Article by Jill'], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}\n"
     ]
    }
   ],
   "source": [
    "import chromadb\n",
    "\n",
    "from chromadb.utils import embedding_functions\n",
    "\n",
    "sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=\"all-MiniLM-L6-v2\")\n",
    "\n",
    "\n",
    "client = chromadb.Client()\n",
    "# client.heartbeat()\n",
    "# client.reset()\n",
    "collection = client.get_or_create_collection(\"test-where-list\", embedding_function=sentence_transformer_ef)\n",
    "collection.add(documents=[\"Article by john\", \"Article by Jack\", \"Article by Jill\"],\n",
    "               metadatas=[{\"author\": \"john\"}, {\"author\": \"jack\"}, {\"author\": \"jill\"}], ids=[\"1\", \"2\", \"3\"])\n",
    "\n",
    "query = [\"Give me articles by john\"]\n",
    "res = collection.query(query_texts=query,where={'author': {'$in': ['john', 'jill']}}, n_results=10)\n",
    "print(res)\n",
    "\n",
    "res_get = collection.get(where={'author': {'$in': ['john', 'jill']}})\n",
    "print(res_get)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ids': [['1', '3']],\n",
      " 'distances': [[0.2882421016693115, 1.0175083875656128]],\n",
      " 'metadatas': [[{'author': 'john'}, {'author': 'jill'}]],\n",
      " 'embeddings': None,\n",
      " 'documents': [['Article by john', 'Article by Jill']],\n",
      " 'uris': None,\n",
      " 'data': None,\n",
      " 'included': ['metadatas', 'documents', 'distances']}\n"
     ]
    }
   ],
   "source": [
    "import pprint \n",
    "pprint.pp(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ids': ['1', '3'],\n",
      " 'embeddings': None,\n",
      " 'metadatas': [{'author': 'john'}, {'author': 'jill'}],\n",
      " 'documents': ['Article by john', 'Article by Jill'],\n",
      " 'uris': None,\n",
      " 'data': None,\n",
      " 'included': ['metadatas', 'documents']}\n"
     ]
    }
   ],
   "source": [
    "pprint.pp(res_get)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ids': [['1']],\n",
       " 'distances': [[0.2882421016693115]],\n",
       " 'metadatas': [[{'article_type': 'blog', 'author': 'john'}]],\n",
       " 'embeddings': None,\n",
       " 'documents': [['Article by john']],\n",
       " 'uris': None,\n",
       " 'data': None,\n",
       " 'included': ['metadatas', 'documents', 'distances']}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collection.upsert(documents=[\"Article by john\", \"Article by Jack\", \"Article by Jill\"],\n",
    "               metadatas=[{\"author\": \"john\",\"article_type\":\"blog\"}, {\"author\": \"jack\",\"article_type\":\"social\"}, {\"author\": \"jill\",\"article_type\":\"paper\"}], ids=[\"1\", \"2\", \"3\"])\n",
    "\n",
    "collection.query(query_texts=query,where={\"$and\":[{\"author\": {'$in': ['john', 'jill']}},{\"article_type\":{\"$eq\":\"blog\"}}]}, n_results=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BM 25"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/blackink/nltk_data...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tokens: ['artifici', 'intellig', 'found', 'academ', 'disciplin', '1956']\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
     ]
    }
   ],
   "source": [
    "from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer\n",
    "from pymilvus.model.sparse import BM25EmbeddingFunction\n",
    "\n",
    "analyzer = build_default_analyzer(language=\"en\")\n",
    "\n",
    "corpus = [\n",
    "    \"Artificial intelligence was founded as an academic discipline in 1956.\",\n",
    "    \"Alan Turing was the first person to conduct substantial research in AI.\",\n",
    "    \"Born in Maida Vale, London, Turing was raised in southern England.\",\n",
    "]\n",
    "\n",
    "tokens = analyzer(corpus[0])\n",
    "print(\"tokens:\", tokens)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "bm25_ef = BM25EmbeddingFunction(analyzer)\n",
    "\n",
    "bm25_ef.fit(corpus)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embeddings: <Compressed Sparse Row sparse array of dtype 'float32'\n",
      "\twith 24 stored elements and shape (5, 21)>\n",
      "  Coords\tValues\n",
      "  (0, 0)\t1.0208816528320312\n",
      "  (0, 1)\t1.0208816528320312\n",
      "  (0, 3)\t1.0208816528320312\n",
      "  (0, 5)\t1.0208816528320312\n",
      "  (1, 0)\t0.960698664188385\n",
      "  (1, 1)\t0.960698664188385\n",
      "  (1, 6)\t0.960698664188385\n",
      "  (1, 7)\t0.960698664188385\n",
      "  (1, 10)\t0.960698664188385\n",
      "  (1, 12)\t0.960698664188385\n",
      "  (2, 7)\t0.907216489315033\n",
      "  (2, 15)\t0.907216489315033\n",
      "  (2, 16)\t0.907216489315033\n",
      "  (2, 17)\t0.907216489315033\n",
      "  (2, 19)\t0.907216489315033\n",
      "  (2, 20)\t0.907216489315033\n",
      "  (3, 0)\t1.089108943939209\n",
      "  (3, 1)\t1.089108943939209\n",
      "  (3, 5)\t1.089108943939209\n",
      "  (4, 7)\t0.960698664188385\n",
      "  (4, 15)\t0.960698664188385\n",
      "  (4, 16)\t0.960698664188385\n",
      "  (4, 17)\t0.960698664188385\n",
      "  (4, 20)\t0.960698664188385\n",
      "Sparse dim: 21 (21,)\n"
     ]
    }
   ],
   "source": [
    "docs = [\n",
    "    \"The field of artificial intelligence was established as an academic subject in 1956.\",\n",
    "    \"Alan Turing was the pioneer in conducting significant research in artificial intelligence.\",\n",
    "    \"Originating in Maida Vale, London, Turing grew up in the southern regions of England.\",\n",
    "    \"In 1956, artificial intelligence emerged as a scholarly field.\",\n",
    "    \"Turing, originally from Maida Vale, London, was brought up in the south of England.\"\n",
    "]\n",
    "\n",
    "docs_embeddings = bm25_ef.encode_documents(docs)\n",
    "\n",
    "print(\"Embeddings:\", docs_embeddings)\n",
    "print(\"Sparse dim:\", bm25_ef.dim, list(docs_embeddings)[0].shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embeddings: <Compressed Sparse Row sparse array of dtype 'float32'\n",
      "\twith 6 stored elements and shape (2, 21)>\n",
      "  Coords\tValues\n",
      "  (0, 0)\t0.5108256340026855\n",
      "  (0, 1)\t0.5108256340026855\n",
      "  (0, 2)\t0.5108256340026855\n",
      "  (1, 6)\t0.5108256340026855\n",
      "  (1, 7)\t0.115543894469738\n",
      "  (1, 14)\t0.5108256340026855\n",
      "Sparse dim: 21 (21,)\n"
     ]
    }
   ],
   "source": [
    "queries = [\"When was artificial intelligence founded\", \n",
    "           \"Where was Alan Turing born?\"]\n",
    "\n",
    "query_embeddings = bm25_ef.encode_queries(queries)\n",
    "\n",
    "print(\"Embeddings:\", query_embeddings)\n",
    "print(\"Sparse dim:\", bm25_ef.dim, list(query_embeddings)[0].shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
